Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening


Yule Duan, Xuao Wu, Haoyu Deng, Liang-Jian Deng

University of Electronic Science and Technology of China, China

Abstract

Currently, machine learning-based methods for remote sensing pansharpening have progressed rapidly. However, existing pansharpening methods often do not fully exploit differentiating regional information in non-local spaces, thereby limiting the effectiveness of the methods and resulting in redundant learning parameters.

In this paper, we introduce a so-called content-adaptive non-local convolution (CANConv), a novel method tailored for remote sensing image pansharpening. Specifically, CANConv employs adaptive convolution, ensuring spatial adaptability, and incorporates non-local self-similarity through the similarity relationship partition (SRP) and the partitionwise adaptive convolution (PWAC) sub-modules.

Furthermore, we also propose a corresponding network architecture, called CANNet, which mainly utilizes the multi-scale self-similarity. Extensive experiments demonstrate the superior performance of CANConv, compared with recent promising fusion methods. Besides, we substantiate the method’s effectiveness through visualization, ablation experiments, and comparison with existing methods on multiple test sets. The source code is publicly available at https://github.com/duanyll/CANConv

Background

  • Pansharpening
  • Prior arts

Pansharpening

Images captured by remote sensing satellites:

  • PAN: high-resolution PANchromatic images
  • LRMS: Low-Resolution Multi-Spectral images

We hope to obtain High-Resolution Multi-Spectral (HRMS) images.

\mathrm{PAN} + \mathrm{UpSample}(\mathrm{LRMS}) = \mathrm{HRMS}

Cont’d

high resolution panchromatic image

Cont’d

low resolution multi spectral image

Cont’d

high resolution multi spectral image, fused by traditional non-deep-learning based pansharpening algorithms

Prior arts for the task of pansharpening

  • Traditional
  • CNN-based
    • Global adaptive / standard convolution: Dynamic Filter Network (DFN) [1]
    • Spatial adaptive convolution: Pixel-adaptive convolutional neural networks (PAC) [2]
    • Graph convolution: IGNN [3]

[1] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic Filter Networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2016.

[2] Hang Su, V. Jampani, Deqing Sun, Orazio Gallo, Erik G. Learned-Miller, and Jan Kautz. Pixel-adaptive convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1115811167, 2019.

[3] Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, and Chen Change Loy. Cross-Scale Internal Graph Neural Network for Image Super-Resolution, 2020. 2, 3, 1

Cont’d

Comparison

Motivation

Remote sensing images consists of regions (segments) of same semantics, and these regions have very wide span.

An example of remote sensing images containing many similar non-local regions

Previous works use kNN to capture non-local features in an image, in which k nearest neighbors are far from enough in practice, and the increase of k will introduce heavy and redundant computational overhead to the model. Thus, this paper turns to a novel clustering approach instead, by which pixels sharing similar features are clustered and saved in sets.

Methods

  • Build Similarity Relationship Partition (SRP)
  • Partition-Wise Adaptive Convolution (PWAC)
    • centroid of partition
    • generate convolution kernels adaptively
  • Replace ResBlocks with CAN-ResBlocks

Self-Similarity Relationship Partition (SRP)

Input feature map: X \in \mathbb{R}^{\overbrace{H}^{\text{height}} \times \overbrace{W}^\text{width} \times \overbrace{C^\text{in}}^\text{input channels}}

For pixel X_{xy} (a vector), apply spatial mean pooling in a k \times k neighborhood to obtain \boldsymbol{f}_{xy};

\boldsymbol{f}_{xy} = \frac{1}{k^2}\sum_{i=-\lfloor\frac{k}{2}\rfloor}^{\lfloor\frac{k}{2}\rfloor}\sum_{j=-\lfloor\frac{k}{2}\rfloor}^{\lfloor\frac{k}{2}\rfloor}X_{ij}

Apply k-Means to obtain cluster index matrix \boldsymbol{I}, where \boldsymbol{I}_{xy} denotes the index number of the cluster to which pixel X_{xy} belongs.

Construct SRP:

S_i=\{(x,y)|\boldsymbol{I}_{xy}=i\}

Partition-Wise Adaptive Convolution (PWAC)

The core problem is how to construct a matrix for convolving.

Firstly, find the centroid of each cluster:

\boldsymbol{c}_i = \frac{1}{|S_i|}\sum_{(x,y)\in S_i}\boldsymbol{p}_{xy},

where \boldsymbol{p}_{xy} \in \mathbb{R}^{k^2C_\text{in}} representing the unfold of \mathbb{R}^{k \times k \times C_\text{in}} input, i.e., rearrange the elements in the (x, y) centered k \times k area (with C_\text{in} channels) a vector \boldsymbol{p}_{xy}.

In the following parts, the centroid \boldsymbol{c}_i will be used to represent S_i.

Deal with outlier pixels: If |S_i|<\eta \cdot HW (threshold), \boldsymbol{c}_i=\frac{1}{HW}\sum_{(x,y)\in U} \boldsymbol{p}_{xy}.

Cont’d

A global kernel parameter \boldsymbol{W} \in \mathbb{R}^{C_\text{in} × k^2 × C_\text{out}}.

Perceptron: \boldsymbol{w}_\text{cin} \in \mathbb{R}^{C_\text{in}}, \boldsymbol{w}_s \in \mathbb{R}^{k^2}, \boldsymbol{w}_\text{cout} \in \mathbb{R}^{C_\text{out}}.

Adaptive kernel: f_\text{k}(\boldsymbol{c_i}) = \boldsymbol{w}_\text{cin} ⊛ \boldsymbol{w}_\text{s} ⊛ \boldsymbol{w}_\text{cout} ⊙ \boldsymbol{W}.

refers to Keonecker product, defined by: A_{m \times n}⊛B_{p \times q}=\begin{bmatrix} a_{11}B&a_{12}B&\cdots&a_{1n}B\\ a_{21}B&a_{22}B&\cdots&a_{2n}B\\ \vdots&\vdots&\ddots&\vdots\\ a_{m1}B&a_{m2}B&\cdots&a_{mn}B\\ \end{bmatrix}.

refers to element-wise product.

Convolution output: \boldsymbol{Y}_{xy} = \boldsymbol{p}_{xy} \otimes f_\text{k}(\boldsymbol{c}_{\boldsymbol{I}_{xy}}) + f_\text{b}(\boldsymbol{c}_{\boldsymbol{I}_{xy}}).

Replace ResBlocks with CAN-ResBlocks

In the proposed CANNet:

  • traditional Convolution kernels are replaced with SRPs;
  • traditional Convolution modules are replaced with PWACs.
  • the popular U-Net architecture

The architecture of CANNet, in which blue arrows represents the flow of cluster index matrices, and black ones represents the flow of feature maps.

Experiments

  • Data sets:
    • World View 3
    • Quick Bird
    • GaoFen-2
  • The model is evaluated on these metrics:
    • \mathrm{SAM}, \mathrm{ERGAS}, \mathrm{Q4/Q8}
    • D_{\lambda}, D_s, \mathrm{HQNR} (Hybrid \mathrm{QNR})
  • Ablation
    • Disable SRP
    • Disable PWAC

Effect

Effect

Evaluation

Results on the World View 3 data set

Ablation

Ablate SRP, and treat the entire feature map as a single cluster or each pixel as a cluster.

Disable SRP

Ablate PWAC, and use Multi-Layer Perceptron instead.

Disable PWAC

Summary

  • Construct convolution partititions by aggregation and clustering
  • Generate convolution kernels adaptively
  • Replace convolution algorithms by CANConv employed by the popular U-Net model

BTW: Some ideas in this paper may be helpful in paddy classification.

Appendix

This part lists the definition of several metrics used in this paper. For more information, see also Multispectral and Panchromatic DataFusion Assessment Without Reference [4].

  • \mathrm{SAM}
  • \mathrm{ERGAS}
  • \mathrm{Q4}
  • D_\lambda
  • D_s
  • \mathrm{QNR}

[4] Alparone, Luciano & Aiazzi, Bruno & Baronti, Stefano & Garzelli, Andrea & Nencini, Filippo & Selva, Massimo. (2008). Multispectral and Panchromatic Data Fusion Assessment Without Reference. ASPRS Journal of Photogrammetric Engineering and Remote Sensing. 74. 193-200. 10.14358/PERS.74.2.193.

SAM (Spectral Angle Mapper)

\mathrm{SAM} = \arccos \frac{\vec{x}^\mathsf{T}\vec{y}}{\|\vec{x}\|\|\vec{y}\|},

where \vec x is the test spectra, and \vec y is the reference spectra. The smaller \mathrm{SAM} is, the higher the probability that \vec x and \vec y refer to the same type of object.

ERGAS (Erreur Relative Globale Adimensionelle de Synthèse)

\mathrm{ERGAS} is a French acronym, meaning Relative Dimensionless Global Error.

\mathrm{ERGAS} = 100\;\frac{d_\text{PAN}}{d_{LRMS}}\sqrt{\frac{1}{L}\sum_{l=1}^L\left(\frac{\text{RMSE}_l}{\mu_l}\right)^2}

  • d_\text{PAN}: pixel size of PAN;
  • d_\text{LRMS}: pixel size of LRMS;
  • L: number of bands [5]
  • \mu_l: mean of the lth band
  • \mathrm{RMSE}_l: Root Mean Squared Error of pixels in the lth band

[5] band, i.e., channel

Q4 (QI of Four-Band Images)

\mathrm{QI} means Quality Index.

A pansharpened multi-spectral (MS) image has four bands. The quality index \mathrm{Q4} is a generalization to four-band images of the \mathrm{QI} [6], which can be applied only to monochrome [7] images.

The calculation of \mathrm{QI} can be generalized to n-band images.


[6] Wang, Z., and A.C. Bovik, 2002. A universal image quality index, IEEE Signal Processing Letters, 9(3):81–84.

[7] black-and-white

Spectral Distortion Index (D_\lambda)

D_\lambda = \sqrt[q]{\frac{1}{L(L-1)}\sum_{l=1}^L\sum_{r=1}^L\left|Q\left(\hat{G}_l, \hat{G}_r\right) - Q\left(\tilde{G}_l, \tilde{G_r}\right)\right|^p}

  • \left\{\hat{G_l}\right\}_{l=1}^L: Inter-band \mathrm{QI}s calculated from the fused MS bands;
  • \left\{\tilde{G_l}\right\}_{l=1}^L: Inter-band \mathrm{QI}s calculated from the low-resolution MS bands.

Spatial Distortion Index (D_s)

D_s = \sqrt[q]{\frac{1}{L}\sum_{l=1}^L\left|Q\left(\hat{G}_l, P\right) - Q\left(\tilde{G}_l, \tilde{P}\right)\right|}

  • P: PAN image;
  • \tilde{P}: spatially degraded PAN, obtained by filtering with a lowpass filter having normalized frequency cutoff at the resolution ratio between MS and PAN, followed by decimation.

Quality with No Reference (\mathrm{QNR})

The product of the one’s complements of the spatial and spectral distortion indices, each raised to a real-valued exponent that attributes the relevance of spectral and spatial distortions to the overall quality.

\mathrm{QNR} = \left(1 - D_\lambda\right)^\alpha\cdot\left(1 - D_s\right)^\beta, \alpha, \beta \in \mathbb{R}